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In this paper in which we address the evaluation of measurement process quality, we mainly 
focus on the evaluation procedure, as far as it is based on the numerical measurement out- 
comes. We challenge the approach where the "exact" value of the observed quantity is com- 
pared to the error interval obtained from the measurements under test and we propose a 
procedure where reference measurements are used as "gold standard". To this purpose, we 
designed a specific t-test procedure for this purpose, explained here. We also describe and 
discuss a numerical simulation experiment demonstrating the behaviour of our procedure. 
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1. Introduction 

In an experimental context, a number of circumstances require the evaluation of a 
factor influencing the quality of measurements like the apparatus, its calibration, 
the context of the measurement set-up and the person(s) performing the measure- 
ment. In this paper, we address the question of the optimal use of the measurement 
outcomes in the evaluation process. Note that we don't exclude the use of supple- 
mentary quality criteria in this evaluation process, but we are convinced that the 
measurement results contain sufficient (complementary) information to justify a 
more thorough study of their use. 

We have chosen as example the evaluation of measurements performed by stu- 
dents in a student lab. A teaching assistant explains to a group of freshmen how 
a given procedure needs to be performed to measure a specific physical quantity. 
The students are asked to repeat this procedure a number of times, to calculate 
from the measurement data a mean measurement value m, as well as an estimate 
s m of the standard deviation of m, and to state in their report that they consider 
the observed physical quantity fj, being characterized by an error interval m ± s m . 

The evaluation of the students may include the observation by a teacher or a 
teaching assistant of the actions by these students during the measurements, as 
well as the assessment of written reports and/or oral tests. The outcomes of the 
measurements are a valuable source of information regarding the performances of 
the students. The measurement data are usually considered to be normally dis- 
tributed. Although it may be interesting to study the effect on the measurement 
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procedure of a violation of this simplifying assumption, we will assume its cor- 
rectness. In the specific pedagogical setting the exact value fj, of the measured 
quantity is assumed being known. The fact whether or not this value is situated 
in the reported interval, or a scaled version m ± as m thereof, is used as a criterion 
to evaluate the quality of execution of the measurement procedure. We will show 
in this paper that the described type of approach needs to be challenged from a 
statistical point of view, and we will propose an alternative for the assessment of 
the measurement outcomes. 



2. Theoretical considerations 

Observing a physical quantity by executing a measurement procedure n times 
is equivalent to drawing a random sample {x\,X2, ■ ■ ■ , x n } from a population of 
measurement data, distributed about an expectation fi (the actual value of the 
measured quantity, treated as unknown) with a standard deviation a. The first 
steps of error analysis are: 

• the calculation of an estimate m for fi, which is the arithmetic sample mean, 

• the calculation of an estimate s for a: 

2 _ TJl=ii x i - m ) 2 

n — 1 

• the calculation of an estimate s m for the standard deviation a m of the mean 
value, which is s m = s/yfn by virtue of the root-n law. 

In the measurement evaluation procedure mentioned in Section [TJ comparing fj, 
with the scaled error interval m±as m is equivalent to performing a hypothesis test. 
The sample under test consists of the n measurement outcomes, the test variable 
is t m = (m — / u)s~ 1 , and the null hypothesis Ho is E{m} = fi. Indeed, under this 
hypothesis and the assumption of normality of the measurement outcomes, the 
distribution of t m is known to satisfy Student's t-distribution with n — 1 degrees 
of freedom. The test consists of verifying whether t m ^ [—a, a] and to reject the 
null hypothesis in this case. This procedure boils down to a verification whether 
fi $l [m — as m , m + as m ] in which case the validity of the measurements is rejected. 
The significance level of the test is 1 — P a , where P a = P(i m £ [—a, a]). One 
parameter that can be chosen is the sample size n. If this is relatively high (say, 
n > 30), s m can be approximated by a m and t m « (m — /u)cr~ approximately 
satisfies a standard normal distribution. The parameter a should be sufficiently 
high to decrease the significance level. E.g. for high n and a = 1, P\ is only 
0.68 which means that correctly performed measurements will only be accepted 
as such with a probability of 68%! For n < 30 this figure is even worse. However, 
increasing a weakens the test. In many situations, it will be impossible to find a 
satisfactory trade-off for the choice of a that avoids the wrongful rejection of correct 
measurements and at the same time yields a criterion to detect bad measurements 
that has a sufficient sensitivity (the rightful rejection ratio) especially when n is 
small. 

However, our main criticism does not concern the choice of a (e.g. a = 1), but the 
fact that the evaluation criterion is only dependent on the measurements under test. 
Consequently, independently of the choice of o, an increased value for s m , which 
should be interpreted as a decrease in measurement quality, actually leads to a 
increased acceptance of the measurements. 
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3. Methods 

The main problem is that the parameters fi and a 2 characterizing correctly ac- 
quired measurements are generally unknown. In the present paper, we propose a 
methodology using the outcomes of a reliable reference measurement as ground 
truth for a decision on the validity of the measurements under test. In our student 
evaluation example, this could be realised by letting a skilled teaching assistant 
or lab technician repeat the measurement process - say, N times. This leads to 
unbiased estimates tor and sj^ of the operational parameters \i and a 2 respectively 
(provided that the unit(s) of measurement are sufficiently refined for the effect of 
discretisation to be negligible [lj]) - the suffix R stands for "Reference". We will 
denote by mr and s 2 ^ the sample mean and variance of the measurements under 
test (hence, suffix T). Before we describe the measurement evaluation procedure, 
we formulate the following underlying assumptions: 

• correctly acquired measurement data satisfy the normal distribution N(/j,,a 2 ), 

• the set of reference measurement outcomes are considered as a representative 
sample from the population of "correctly acquired" measurements, i.e. and 
Sr, are unbiased estimates of (i and cr 2 , respectively. 

For comparing the reference measurement data and test measurement data, two 
criteria are straightforward candidates: the mean value and the sample variance. 
In this section, we address both of them. In the remainder of the text, we mainly 
concentrate on the former criterion, because its use is less obvious. 



3.1 Evaluation on the basis of the mean value of the measurement data 

A sound measurement quality assessment procedure only should wrongfully accuse 
a measurement process of yielding bad outcomes at a specific low rate of, e.g., 
one to one hundred on average. We designed the formula for the test variable of 
a hypothesis test that leads to the definition of an acceptance interval for the 
mean value of the measurement data under evaluation. The design is such that 
this interval is solely dependent on the reference measurements and on the chosen 
operational parameters N and n (the number of reference measurements and test 
measurements respectively). The hypothesis test is based on the following definition 
of variable t: 



We can show (see Appendix [A]) that t satisfies Student's t-distribution with N — 1 
degrees of freedom under the general assumptions formulated earlier and under 
the (null) hypothesis that the measurement process under test is correct, implying 
that the resulting measurements satisfy N(//, a 2 ). 
The acceptance interval can be derived from the relation between t Qj 7v_i and a 

in 



Replacing t by it's expression from Eq. (TjQ), and reformulating the resulting in- 




(1) 




(2) 
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equalities yields 



P \m T - m R | < i„ AT- 




I—a, 



which defines as acceptance interval for a given a: 



(3) 



The critical i-value t Qj 7v-i can be found from the cumulative probability function 
of t for N — 1 degrees of freedom - considering that the probability density function 
is symmetric and consequently Eq. ([2]) is equivalent to: 



1 - a/2. 



For a given a, one can find t a> N-i in a table of Student's t distribution - see, e.g., 
Ref. Qj. 

The reader may wonder why we don't propose one of the classical t-tests for 
testing the difference of mean values for independent samples. There exist two 
variants of these tests: one in which the variances of the data in the two samples 
are assumed equal and one where this assumption is not required. It is obvious that 
in general the former model does not hold for the case where test measurements 
need being compared to reference measurements. 

A common formal expression for the second model, is given by: 



(m R - m T ) 



where df 



a R I 

N ^ 



a R I 

N " l " 



JV-1 



(4) 



The specific equation for df is known as the Welch-Satterthwaite equation. 0] The 
model is used in studies where the variances of the underlying variable x are allowed 
to be different in the populations from which the two samples are observed. We 
already announce here that we dismiss this model as a basis for the evaluation of 
measurements and refer to sections H] and [5] for more details about our reasons to 
do so. 



3.2 Evaluation on the basis of the variance of the measurement data 

Here, we can directly derive the acceptance interval for the variance from a stan- 
dard F-test. Under the null hypothesis that sr and st have been calculated from 
correct measurement data - i.e. two independent samples of data satisfying the 
same normal distribution N(//,<r 2 ), F = st/sr is distributed according to the 
F-distribution F(n - 1,N - 1). 

Assuming that reducing the quality of the measurements will increase the vari- 
ance of their outcomes, one readily can formulate the acceptance interval as 

where the critical F-value i^n-i,^-! is derived from the cumulative probability 
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function: 



P(F < F a>n _ liJV -i) = 1 - a. 



Sometimes, the nature of the measurement process requires considering the possi- 
bility that the reduction of the measurement quality either increases or decreases 
the variance of the outcomes - e.g. when the person performing the measurements 
systematically tends to round the observed quantities to the same value. In this 
case the acceptance interval should be: 



4. Numerical Experiments 

We performed some numerical experiments to verify in practice the theoretical 
considerations. Each experiment aimed at estimating the rejection ratio p of mea- 
surements satisfying one combination of p^ and <tt parameter values by repeating 
the simulation of one measurement experiment a number of times with these spe- 
cific parameter values. For the estimation of p, we calculated the fraction p pt of 
simulations where the outcomes of the (simulated) measurement experiment are 
rejected (i.e. a point estimate), as well as an interval estimate of p - [p\ , /5 n i] - at a 
95% confidence interval. Evidently, for measurements from a correct measurement 
process (pr = p and <tt = o"), we expect that p(= E{p pt }) = a. We considered as 
hypothetical measurement experiment a titration performed by freshmen where the 
volume of titrant, necessary to neutralize a standardized quantity of acid would be 
p = 21.35cm 3 . The reference measurements are produced by a laboratory techni- 
cian performing a titration, repeated N = 10 times, with an accuracy characterized 
by a = 0.01cm 3 . The mean result obtained by the technician is compared to the 
mean result obtained by a student who repeats the titration n = 3 times and 
measures according to parameters p^ and or- In case pt = p and a? = a, we 
are dealing with correctly performed measurements. If in that case the student's 
mean titration volume falls outside the acceptance interval, given by Eq. ([3]), we are 
confronted with a wrongful rejection. In order to perform an accurate estimation 
of the (rightful or wrongful) rejection rate p for different combinations of p^, ctt? 
and a, we simulated 10 6 independent experiments, where each time the student's 
outcome is tested using the aforementioned acceptance interval. The outcomes of 
our numerical experiments are summarized in Table [U In order to compare the 
criterion based on a scaled error interval to the one presented here, in terms of 
sensitivity, we repeated some of the experiments with the same parameters and 
calculating the rejection ratio using the former criterion. The outcomes of these 
numerical experiments are summarized in Table [2j A selection of the numerical 
experiments, reported by Table [1] (those for p^ = 21.35, <tt = 0.01), have been re- 
peated, but with the criterion based on Eq. @. The outcomes of these experiments 
are summarized in Table O 

5. Discussion 

In section [37X1 we propose an method for the evaluation of a measurement process, 
based on a non-standard Student's t-test, justified theoretically by Appendix lAl 
and validated by the numerical experiments described by section [H Our numeri- 
cal experiments compare the sensitivity of this methodology with the "classical" 
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Tabic 1. Experiments simulating measurements by freshmen compared to a skilled lab technician (/ir — 21.35, 
(Tr — 0.01, N — 10, n = 3) — results from 10 6 simulations are a point estimate for p - p pt — and an interval 
estimate for it (confid. level 95%) [pioiPiii] ^ or an accc ptance criterion given by Eq. f3l . 
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Table 2. Experiments simulating measurements by freshmen and evaluated on the basis of expression p £ 
m ± as-m as acceptance criterion, where a corresponds to P a — 1 — a — P(tm £ [— a, a]) as explained in section [2] 
- results from 10 e simulations are a point estimate for p - p p t — and an interval estimate [pio, Phi]- for it (confid. 
level 95%). 
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0.003770 


0.003652 


0.003892 


0.010 


0.019548 
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Tabic 3. Experiments simulating measurements — same parameters as for Tableland satisfying Ho, i.e. — 
21.35, ctt — 0.01, but with as acceptance/rejection criterion the t-test for the model given by Eq. d4l . 

approach based on a (scaled) error interval. In the latter approach the value of a 
is chosen to yield an equivalent criterion as ours in terms of a - the wrongful re- 
jection ratio. The experiments demonstrate that with an appropriate choice of the 
parameter N, the sensitivity of our approach is much higher than for the (scaled) 
error interval approach (Tables Q] and [2]) . Moreover, note also that where a bad 
measurement procedure only affects ctt, (last row of Table [2]), the sensitivity of 
the error interval criterion even doesn't exceed a, whereas ours is sensitive both to 
bias and high ctt- 

The reader could wonder why not using one of the two classical t-tests for inde- 
pendent samples. We already briefly introduced this question at the end of subsec- 
tion 13.11 and mentioned the hypothesis test of the difference of mean values under 
the assumption that the variances of the data in the two involved samples are al- 
lowed to be different. This problem is known as the Behrens-Fisher problem. In 
Ref. [H, it is already pointed out that for this problem "There is no completely 
satisfactory solution known". One very popular solution, found in most text books, 
is the difference of means test for unequal population variances, using the Welch- 
Satterthwaite equation - see Eq. (|4]).[4| One should realize that t in this equation 
is generally only approximately distributed according to Student's t-distribution. 
The shortcomings of the technique when both N and n are small, with nevertheless 
an important discrepancy, are demonstrated by our numerical simulation experi- 
ments (Table [3]), where under the null hypothesis the wrongful rejection ratio is 
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systematically significantly larger than the chosen significance level a. Moreover, 
the sample variances sr and st are treated as equivalent in the calculation of the 
estimate of the variance of the denominator of the expression for t. In our case, 
we assume that is an unbiased estimator for a 2 , a parameter that (together 
with the constants N and n) fully determines the distribution of (tor — mr) under 
the null hypothesis. It is reasonable to assume that in most cases the variance of 
the data from badly performed measurements is greater than the variance of the 
outcomes of correctly performed measurements. In that case, treating the sample 
variances as equivalent would boil down to weakening the test, especially when 
considering that normally n < N, causing the term in s T to dominate the term 
in in the denominator of the expression for t. In the test, proposed here, we 
decided not to incorporate s T in the expression for t - see Eq. (pQ) - and to retain 
only the "reference" sample standard deviation sr. An additional advantage is that 
this approach yields an acceptance interval that only needs to be calculated once 
for the evaluation of measurement outcomes from several students, since this inter- 
val is only dependent on the reference measurements by the laboratory technician. 
The numerical simulation experiments demonstrate that for correct measurement 
processes (Table [T] for /ix = 21.35 and <tt = 0.01), the wrongful rejection ratio is 
consistent with the chosen significance level and therefore perfectly controllable. 
The effect on p of a bias in the measurements under evaluation is demonstrated 
with /xt = 21.37 and shows how this p (the - this time rightful - rejection rate) 
increases at the expense of an increasing a. This means for our specific simulated 
measurements example that if we find such bias of 0.02 sufficiently high for a mea- 
surement process to be qualified as bad, and we are satisfied with identifying 77% 
of the measurement processes affected by such bias, we must accept to wrongfully 
reject 5% of the correct measurement processes. One may conceive to introduce 
in the procedure additional information about measurement quality by combining 
the test on the basis of the mean value of the measurements (subsection 13. 11) with 
the one based on the variance (subsection 13. 2\ , but then one should take care to 
reduce the a of each test by one half to bring the significance level of the combined 
test to (at most) a (Bonferroni correction). 
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Appendix A. Student evaluation based on the mean measurement value — 
theoretical background 

The goal of this appendix is to demonstrate that the expression for t in Eq. ([I]) 
satisfies Student's t-distribution with N — 1 degrees of freedom. 
Let us start from the formal definition of Student's t-distributed variable, con- 
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sisting of the following elements: 0] 

(Dl) If the random variable X is normally distributed with mean and variance 
a 2 , and . . . 

(D2) ... if Y 2 /a 2 has a x 2 distribution with v degrees of freedom, and . . . 
(D3) ... if X and Y are independent, . . . 

(D4) . . . then t = X ^ is distributed as a t-distribution with v degrees of freedom. 

As linear combination of normally distributed terms to,r and m-p, expression 
(tor — mr) satisfies a normal distribution. Since and tbt have the same ex- 
pectation jU, the expectation of (rrtR — mr) is zero. Also, we are dealing with the 
mean values of independent sets of measurement data. Therefore jtir and jtit are 
statistically independent, resulting in the variance of (rrtR — tot) being the sum of 
the variances of tor and mr. Formally: 

(m R -m T ) :N(0,cr 2 ( -j- + - 



N n 

This allows us to introduce a variable X and to equate it to an expression that 
satisfies element (Dl) of the definition of the t-distribution, formulated earlier: 



x = \l WT^ {mR ~ mT) : n(0 ' (t2) - (A1) 

Let us now introduce a variable 

Y = ^/vs K with v = N-l. (A2) 

From this definition, and the fundamental properties of the sample variance of 
a normally distributed variable follows that Y 2 /a 2 has a \ 2 distribution with v 
degrees of freedom, which satisfies element (D2) of the aforementioned definition. 

Finally note that X and Y are independent - definition element (D3) - since on 
the one hand tor and sr are mutually independent as mean and variance of data 
of the same sample and on the other hand, tot and sr, are independent as statistics 
of two independent sets of sample data. 

This means that t = X y" satisfies the t-distribution according to definition 
element (D4), but substituting X, Y and v in the latter expression for t, using 
Eqs. flSU) and (jA2]) yields exactly Eq. (pQ). 



